In [1]:
import nltk
from nltk.corpus import brown
As seen in the Chuang et al. paper and in the Manning and Schuetze chapter, there is a well-known part-of-speech based pattern defined by Justeson and Katz for identifying simple noun phrases that often words well for pulling out keyphrases.
Chuang et al use this pattern: Technical Term T = (A | N)+ (N | C) | N
Below, please write a function to define a chunker using the RegexpParser as illustrated in the section Chunking with Regular Expressions. You'll need to revise the grammar rules shown there to match the pattern shown above. You can be liberal with your definition of what is meant by N here. Also, C refers to cardinal number, which is CD in the brown corpus.
In [2]:
grammar = r"""
T: {<JJ.*|N.*>+ <N.*|CC|CS>|<N.*>}
"""
t_term = nltk.RegexpParser(grammar)
In [3]:
nltk.help.brown_tagset("CD.*")
Below, please write a function to call the chunker, run it on some sentences, and then print out the results for those sentences.
For uniformity, please run it on sentences 100 through 104 from the tagged brown corpus news category.
Then extract out the phrases themselves using the subtree extraction technique shown in the Exploring Text Corpora category. (Note: Section 7.4 shows how to get to the actual words in the phrase by using the tree.leaves() command.)
In [4]:
sents = brown.tagged_sents()[100:105]
In [5]:
parsed_sents = [t_term.parse(s) for s in sents]
In [6]:
tech_phrases = [[t for t in s.subtrees() if t.node=="T"] for s in parsed_sents]
In [7]:
tech_phrases
Out[7]:
For this next task, write a new version of the chunker, but this time change it in two ways:
Note that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer. You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)
Tagger: Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:
In [8]:
import corpii
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
In [9]:
debates= nltk.clean_html(corpii.load_pres_debates().raw())
sents = sent_tokenizer.sentences_from_text(debates)
In [10]:
def build_backoff_tagger (train_sents):
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
return t2
tagger = build_backoff_tagger(brown.tagged_sents())
In [11]:
token_regex= """(?x)
# taken from ntlk book example
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
"""
In [12]:
t_sents = [nltk.regexp_tokenize(s, token_regex) for s in sents]
In [13]:
tagged_sents = [tagger.tag(s) for s in t_sents]
#tags = tagger.tag(tokens)
Chunker: Code for the proper noun chunker goes here:
In [15]:
re_noun_chunk = r"""
NP: {<NP>+|<NP><IN.*|DT.*><NP>}
"""
np_parser = nltk.RegexpParser(re_noun_chunk)
Test the Chunker: Test your proper noun recognizer on a lot of sentences to see how well it is working. You might want to add prepositions in order to improve your results.
In [16]:
for i in range(0,50):
print "**********************************************"
print sents[i]
print "Proper Nouns:"
print [t for t in np_parser.parse(tagged_sents[i]).subtrees() if t.node=="NP"]
FreqDist Results: After you have your proper noun recognizer working to your satisfaction, below run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency. That code goes here, along with the output:
In [17]:
trees = [np_parser.parse(s) for s in tagged_sents]
pnouns = [i for t in trees for i in t.subtrees() if i.node=="NP"]
In [18]:
pn_freq = nltk.FreqDist([pn.pprint() for pn in pnouns])
In [19]:
pn_freq.items()[0:20]
Out[19]:
Just FYI, in Wednesday's October 8's assignment, you'll be asked to extend this code a bit more to discover interesting patterns using objects or subjects of verbs, and do a bit of Wordnet grouping. This will be posted soon. Note that these exercises are intended to provide you with functions to use directly in your larger assignment.
In [18]: